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Abstract 

Computing on graphics processors is maybe one of the most important developments in computational science to happen in 
decades. Not since the arrival of the Beowulf cluster, which combined open source software with commodity hardware to truly 
democratize high-performance computing, has the community been so electrified. Like then, the opportunity comes with challenges. 
, The formulation of scientific algorithms to take advantage of the performance offered by the new architecture requires rethinking 
core methods. Here, we have tackled fast summation algorithms (fast multipole method and fast Gauss transform), and applied 
f^) algorithmic redesign for attaining performance on gpus. The progression of performance improvements attained illustrates the 
exercise of formulating algorithms for the massively parallel architecture of the gpu. The end result has been gpu kernels that run at 
5— i over 500 Gop/s on one nvidia tesla C1060 card, thereby reaching close to practical peak. 
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1. Introduction 

The reality of the past few years is that the computing indus- 
try is betting everything on parallel computing, with first the 
multi-core era, and now a many-core trend [1]. The implication 
of this trend is that it will not be enough to develop applications 
for quad- or eight-core systems, but rather for tens and hundreds 
of processor systems. The unwelcome news that industry play- 
ers are delivering is that many-core architectures require "go- 
ing back to the algorithmic drawing board"[2]. Incrementally 
thinking about "parallelizing" an existing serial formulation of 
an algorithm will not work for the many-core architecture. We 
have to rethink the logic of the formulation from the bottom up, 
and recast the algorithm starting from the mathematics. 

Concurrently to the changing situation for cpu technology in 
recent years, new prominent hardware architectures have come 
into play. Attracting a tremendous amount of the attention is 
the programmable gpu (graphics processing unit), with its dra- 
matically faster progress compared to cpus. Figure 1 shows 
the computing capacity, measured in Gigaflop/s, of subsequent 
generations of Intel cpus and nvidia gpus, as shown in [3] and 
oft-times cited since. The peak performance of the latest gpu 
is several times greater than the latest cpu, but more important 
is the trend. In addition to the raw performance, gpus have ex- 
cellent performance-per-watt. As high-performance computing 
centers are more and more burdened by high power costs, the 
energy efficiency of the gpu makes it an irresistible choice. The 
price/performance ratio, finally, makes the gpu architecture a 
winner in various settings. Given that the market for gpus was 
driven by the games industry, it is commodity hardware and 
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Figure 1 : Evolution of the computing capacity of several generations of cpus 
and gpus. Adapted from [3]. 

thus cheaper than alternatives for high-performance computing 
based on 'mainframe' servers. 

The real opportunity for using gpus in scientific comput- 
ing came about with the release of cuda, a C-language exten- 
sion that gave programmers relatively easy access to the gpu 
for coding general algorithms (first publicly released by nvidia 
in February 2007). However, there are significant challenges 
as the specialized hardware architecture again forces us to go 
"back to the algorithmic drawing board" [2]. Indeed, there is 
an immediate need for research into algorithms so that we can 
exploit the performance offerings of the gpu architecture. This 
is the motivation of the present work. 

In scientific computing, there are multiple algorithms that 
pervade through a variety of scientific or engineering applica- 
tions. Here, we focus on two fast summation algorithms — fast 
multipole method, fmm, and fast Gauss transform, fgt — and 
present examples of the workflow to produce a formulation of 
these algorithms that will maximize the performance on the gpu 
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hardware architecture. The algorithms selected in this work 
program constitute enabling mathematical technology used in 
many numerical methods. These, in turn, are key tools for en- 
abling specific application areas in science and engineering. 

The focus of our research in regards to the applications is 
in large-scale computational fluid dynamics (cfd) using parti- 
cle methods, and radial basis function (rbf) methods. For in- 
stance, consider the vortex particle method in cfd, where one 
has a large number of particle-like objects that are used for the 
mesh-free discretization of the Navier-Stokes equation in vor- 
ticity formulation; see [4] for details of the method. The evalua- 
tion of the velocity field in this formulation results in an A^-body 
problem. In general, the A^-body problem consists of evaluating 
all the pair-wise interactions of a set of N bodies due to a force 
potential, given by: 

N 

f(x j ) = J]c i K(xj < xd. (1) 

i=l 

Here, {(c,-, x,)} represents the set of N source particles with each 
particle i having a coordinate x, and a weight c,; and K.(xj,xi) 
represents the kernel function that defines the interactions be- 
tween two particles. The direct evaluation of Equation 1 at all 
N particle locations has a computational complexity of 0(N 2 ). 
It is often the case in problems of scientific interest that the set 
of particles used in simulations is in the order of thousands to 
millions. For such large simulations, the direct evaluation of 
the particle interactions would take an unreasonable amount of 
computational time. This is one of the main motivations for 
the use of fast summation methods that are able to accelerate 
the evaluation of the particle interactions. Such methods re- 
duce the computational complexity of the A^-body problem to 
0(N log AO or O(N), by rapidly computing and aggregating ap- 
proximated pairwise particle interactions. 

Perhaps the most renowned of the fast A^-body methods is 
the fast multipole method, fmm, discovered by Greengard and 
Rokhlin [5]. Its applications have included long-range electro- 
statics in dna simulations [6], calculations of vortex sheet evolu- 
tion [7], room acoustics and scattering from multiple bodies by 
means of the Helmholtz equation [8], and many others. We will 
describe the basic aspects of the fmm algorithm in the next sec- 
tion, but in a necessarily brief fashion. Another fast algorithm, 
which specializes the fmm to the case when the kernel function 
in (1) is a Gaussian function, is the fast Gauss transform, fgt 
[9]. This paper covers both the fmm and the fgt, and their im- 
plementation on the gpu architecture for achieving maximum 
performance. After giving a background on the two algorithms 
that we treat, a motivation for gpu computing follows in §3. The 
optimization of the algorithmic kernels is described in detail in 
§4, followed by details of the implementation and discussion of 
the performance obtained (§5) and final concluding remarks. 

2. Background on fast summation methods 

Of central importance to fast summation methods is the idea 
that when evaluating pairwise interactions, the solution is only 
required up to a given accuracy. This idea opens the door to 



methods that use efficient schemes to rapidly compute an ap- 
proximation to the interactions, as an alternative to directly 
evaluating them. It can be said that fast summation methods 
trade numerical accuracy for computational speed. In this work, 
we focus on two methods that are classified as analytic -based 
fast algorithms, as they use analytic approximations to obtain 
speed and arbitrary accuracy: the Fast Multipole Method, fmm, 
and the closely-linked Fast Gauss Transform, fgt. 

2.1. Fast multipole method 

The fmm was developed to calculate the pairwise interactions 
of large sets of particles. Whenever the physical interactions of 
interest have long-range effects {e.g., gravitational and electro- 
static problems), evaluating the interaction among N particles 
in principle requires 0(N 2 ) operations. The fmm of Greengard 
and Rokhlin [5] reduces the computational complexity of this 
problem to O(N) operations by approximating and aggregating 
the pairwise interactions. A detailed description of the fmm al- 
gorithm is outside the scope of this paper. We will only give a 
brief description of the method and outline the most important 
steps of the algorithm. For a formal and more detailed descrip- 
tion of the fmm algorithm, the reader can consult the original 
reference [5], or other surveys such as [10]. 

In a nutshell, the fmm algorithm accelerates the evaluation of 
all the pairwise particle interactions by approximately evaluat- 
ing them on clusters of particles. At the core of the algorithm, 
a hierarchical decomposition of space is used to group parti- 
cles into clusters at different scales, as shown in Figure 2. In 
the fmm, the influence of a cluster of particles is approximately 
represented by a Multipole Expansion (me), which is an infinite 
series expansion. The me is truncated after a given number of 
terms, enabling the control of the accuracy of approximation. 
By using the me, it is possible to evaluate a distant cluster of 
sources at once and as a whole. The use of the hierarchical 
decomposition of space and the mes reduce the computational 
complexity of the algorithm to 0(N log AO. To further reduce 
the computational complexity to O(N), the fmm introduces the 
concept of Local Expansions (les). An le is a series expansion 
that converges in a local domain. It is used to evaluate the ag- 
gregated effect of the mes of a group of clusters within the local 
domain. By converting all the mes that represent the far-field 
into a single le, the effects due to interactions with particles in 
the far-field can be obtained in a single evaluation. Finally, to 
fully evaluate the domain, the contribution of the interactions 
with the particles in the near-field are computed directly. 

The fmm can be organized into 5 stages: 

1. Initialization — hierarchical decomposition of the space 
and generation of clusters of particles. 

2. Upward sweep — creation of mes for all clusters. 

3. Downward sweep — transformation of mes into les, and 
aggregation of les. 

4. Calculation of the far-field — le evaluation. 

5. Calculation of the near-field — direct evaluation with par- 
ticles in the near-field. 
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Figure 2: The computational domain is hierarchically decomposed into areas that in turn are used to cluster the particles. Using the hierarchical decomposition, the 
near-field and far-field for a given target domain are identified. The hierarchy is used to find spatial relations between clusters at different levels. The combination 
of clusters at different levels are used to approximate the far-field interactions. 



Of the stages enumerated above, the most computationally 
intensive stages are the downward sweep and the calculation of 
the near-field. In many experiments with a serial implemen- 
tation of the fmm, we observed that these two stages can take 
around 99% of the overall computing time for N in the order 
of 10 million, and around 95% for N in the order of 4 mil- 
lion. In a parallel implementation, some overheads noticeably 
appear which reduce the fraction of these stages with respect 
to the total runtime, although they still substantially dominate. 
Complete timing breakdowns of a parallel fmm algorithm de- 
veloped in our group are given in [1 1]. In the present paper we 
will address the gpu implementation of one of the two dominat- 
ing stages of the overall algorithm: the downward sweep. The 
second one, the calculation of the near-field, is a pleasantly par- 
allel problem and it has already been addressed in [12, 13]. We 
would like to note here also that a first publication of work on 
implementing the full fmm algorithm for gpus was presented in 
[14]. 

2.2. Fast Gauss transform 

The fast Gauss transform (fgt) was originally proposed by 
Greengard and Strain [9], as a specialization of the fmm, for the 
case when the interaction function K from Equation (1) takes 
the form of a Gaussian kernel, as follows: 

G(y) = 2 Qi ex P I ^2 I 

Similarly to the fmm, the fgt makes use of analytical approx- 
imations to reduce the computational complexity from 0(N 2 ) 
to 0{N). The fgt relies on two simple but powerful ideas: 
first, that Gaussians at a large distance from an evaluation point 
can safely be ignored with no detriment to accuracy; and sec- 
ond, that multiple individual Gaussians can be combined and 
approximated using a single series expansion. As with many 
other fast summation methods, such as the fmm, the compu- 
tational domain is decomposed in order to leverage these two 
ideas. In the particular case of the fgt, a decomposition based 



on uniform box clusters is used, but the algorithm is agnos- 
tic towards the clustering scheme, a useful property in higher 
numbers of dimensions where simple boxes or cubes may be 
inefficient. Once the computational domain has been divided, it 
is necessary to choose what kind of approximation will be used 
to calculate the potential between any given pair of source and 
target clusters. There are four distinct options, chosen by work 
estimates and outlined below; they are illustrated graphically in 
Figure 2.2. 

1 . Direct evaluation — Use (2) for direct evaluations between 
individual source and target particles. 

2. Hermite series evaluation — Generate Hermite expansions 
from sources around source centers and evaluate against 
individual target particles. 

3. Taylor series evaluation — Generate Taylor expansions 
from sources around target centers and evaluate against 
target particles. 

4. Hermite to Taylor series translation — Generate Hermite 
series around source centers, then translate these series to 
Taylor series around target centers. Finally evaluate these 
Taylor series at target particles. 

We will briefly describe each of these methods in turn, and 
the situations in which they are used. For a more detailed de- 
scription of the algorithm, we refer the reader to the original 
reference [9]. The first form of interaction, the direct evalua- 
tion, corresponds to applying the original sum of Equation (2). 
However, rather than evaluating all source points in the com- 
putational domain against a given target sub-domain, only the 
sources within a local sub-domain are considered — by doing 
so, the entire sum does not devolve into 0(N 2 ) complexity. For 
this reason, direct evaluation is only used where the number of 
both sources and targets is small. 

In cases where there are a large number of source and target 
particles, one uses a more efficient way of calculating the po- 
tential than direct evaluation, based on series expansions. There 
exist three different interaction approximations that can be used 
for this purpose, these are: Hermite series (3), Taylor series 
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Figure 3: Illustration of the 4 main interactions in the fgt as previously enumerated. Source clusters are on the left, target clusters on the right. Arrows indicate 
direction of interaction. 



(4), or an approach that combines the use of both by translat- 
ing a pre-existing Hermite series into a Taylor series around a 
different point (5). In practice, the approximation to be used de- 
pends on the computational efficiency of a given situation. For 
all series, the accuracy can be controlled by choosing a pair of 
multi-index variables, a and ft, with h n (t) = e H n (t) and H„ 
denoting the «th Hermite polynomial. Furthermore, we use the 
relation H a (t) - H ai (£)■■■ H ad (t) with the of-dimensional multi- 
variate indices a and ft. Other relevant properties of multi- 
variate indices used in the fgt are as follows: a\ - a\\ ■ ■ ■ ay!, 
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(3) 

(4) 
(5) 



The first series (3) is used in the case where there are many 
source points but comparatively few targets. In this case, a Her- 
mite series is created about a cluster center, s B , with coefficients 
A a . The second alternative is a Taylor series about the center of 
the target cluster, tc- This is used when there are a large num- 
ber of target points, but comparatively few sources. This en- 
sures that time is not wasted generating expensive Hermite ex- 
pansions when there are not enough sources to justify the cost. 
Finally, when there is a large number of both sources and tar- 
gets, one uses a combination of both types of series. First, a 
Hermite series is created about the center of the source cluster 



as before, then that series is translated into a Taylor series about 
the center of the target cluster, which can then be evaluated. 

Once all the necessary series and translations have been cre- 
ated, they must be evaluated to give the final potential using 
either (6) for Hermite series, or (7) for Taylor expansions. 



G(y) = YaJiJ--^) 



(6) 



(7) 



Using combinations of the different evaluation strategies one 
can approximately represent the potential field at the evaluation 
locations. All of the options for evaluating potentials that have 
been described above are combined to form a fast, efficient al- 
gorithm that runs with O(N) complexity. It is notable that all 
the formulae used are agnostic towards the number of dimen- 
sions of the Gaussian basis. Thus, the fgt is particularly useful 
in fields where problems of this kind occur in high dimensions, 
as in options pricing [15], information theory [16], image pro- 
cessing [17], and others. 

3. Motivation: the opportunities of the new hardware ar- 
chitectures 

We focus on the cuda architecture and programming frame- 
work from nvidia. That said, similar hardware produced by 
other vendors (such as amd and intel) as well as different pro- 
gramming frameworks (such as OpenCL), rely on the same 
technology paradigm of massively-parallel and throughput- 
optimized processors. Therefore, the principles that we discuss 
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here can be applied generally to current gpu technologies, inde- 
pendent of vendors. 

3.1. The cuda architecture and programming model 

Developed by nvidia, cuda is a parallel programming model, 
built for getting the most out of their gpu hardware architec- 
ture. This model can involve hundreds of processing cores, and 
many thousands of individual threads running at any given time. 
This massive amount of parallelism forces one to focus on de- 
veloping inherently parallel algorithms in order to get perfor- 
mance. On the other hand, the nvidia gpu architecture and cuda 
programming model can mitigate some of the complications of 
parallel programming, many being handled by the hardware and 
transparent to the programmer. 

In this programming model, a parallel program is executed 
as a kernel. These kernels are simply programs written in the C 
language that will be executed by a single thread. It is generally 
the case that a kernel will be executed by thousands of simulta- 
neous threads. Each of these threads is extremely lightweight, 
with no appreciable creation overhead. It also has a unique id 
number, with independent control flow and memory space. A 
further advantage is that active threads can be "hot-swapped" 
by the gpu. Thus, when a thread is stalled due to a memory 
access, it can be swapped to another thread with no extra cost. 
This, combined with the fact that optimal efficiency is gained 
with tens of thousands of threads, allows us to hide the sig- 
nificant memory latency inherent in the architecture. When a 
kernel is executed, it is run on a number of threads that is ex- 
plicitly defined by the user. On the gpu, the threads are handled 
as groups of warps (a 32-thread unit) and all the threads in a 
warp execute using a single-instruction/multiple-thread model 
(simt; see Table 3.1 for acronyms). Under the simt model, all 
threads will perform the same instruction and follow the same 
code execution path. Additionally, the simt model also allows 
the threads to branch in their execution but with a reduced exe- 
cution performance. 

To avoid unnecessary computations, the cuda model allows 
both shared results and shared memory accesses. These com- 
bine to reduce both redundant computations and redundant 
memory accesses. This kind of cooperation, however, does not 
scale well to the thousands of concurrent threads that we have 
already established as the key cuda methodology. The solution 
given by cuda is to limit co-operation to small groups of threads, 
called thread blocks. These blocks are scalable, and allow syn- 
chronization between their component threads. A thread block 
is assigned and executed by a single streaming-multiprocessor 
(sm). One important characteristic of this model, is that it does 
not allow for communication between different blocks. This 
grouping is the model used by cuda to scale on the hardware 
with different numbers of available sms. The more powerful 
gpus will have more sms available and can execute more thread 
blocks simultaneously. 

3.2. Key features of the gpu computing technology 

The essential hardware features of gpus are very different 
from cpus: they have many times more processor cores, and a 



totally different memory system. For example, the main charac- 
teristics of the nvidia gpu chip used in this study — the GT200, 
which is used in both the commodity GeForce GTX 2xx video 
cards and the Tesla CI 060 computing processor — are the fol- 
lowing 1 : 

> 240 'streaming processor' cores per gpu chip, clocked at 
1.296 GHz 

> cores grouped into 'streaming multi-processors', with 8 
cores each 

> each multi-processor has a shared memory of 16 kB 

Note the large number of processor cores, and the small amount 
of shared memory of the gpu. The gpu is specialized, highly- 
parallel hardware, appropriate for computationally-intensive 
tasks. To be specific, the gpu is a streaming, pipelined architec- 
ture. It is 'pipelined' because it is built for processing each of 
many objects in the same way (which is the situation in graph- 
ics rendering). The architecture is optimized for homogeneous 
units of work. It is 'streaming' because it allows a large number 
of simultaneous computational units, with synchronization and 
communication among these units being provided by the hard- 
ware. These fundamental architectural attributes make the gpu 
exceptionally suited to perform computations that have a high 
level of data parallelism. 

Now, let us consider the performance capability of the gpu 
described above. To satisfy the reader of the correctness of the 
marketed performance numbers for this chip, we have to go into 
a little more detail about the architecture than we would like to. 
But, if the reader will bear with us, this will prove revealing. 

The GT200 chip really has 10 processor clusters, or tpcs (see 
Figure 4 for a sketch of the gpu hardware architecture). Each of 
the tpcs has 3 spmt computing cores (sm), consisting of stream- 
ing processors (sp) and special-function units (for transcenden- 
tals, etc.). An sm can issue, in each cycle, instructions to either: 

> 8 single-precision fpus (stream processors) 

> 1 double-precision fpu 

> a branch unit that manages simt execution 

A streaming processor can perform a single precision multipli- 
cation and an addition (mad) in each cycle (2 flop/cycle), thus 
contributing the following to the performance: 

1.296 GHz x 10 tpc x 3 sm x 8 sp x 2 flop/cycle 

= 622.08 Gflop/s. 

The special function units, meanwhile, can compute 4 floating 
point operations per cycle in a vector instruction, and thus con- 
tribute the following to the performance: 

1.296 GHz x 10 tpc x 3 sm x 2 sfu x 1 x 4 flop/cycle 

= 31 1.04 Gflop/s. 



Most of the results in this paper, and all of the development, were made on 
Tesla-series GT200 gpus. During revision, we gained access to the newest gen- 
eration gpu (Fermi architecture) and thus we were able to add results obtained 
on this chip. The discussion in the text, however, does not emphasize Fermi. 
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Acronym 


Meaning 


SPMD 


Single Program, Multiple Data 


SPMT 


Single Program, Multiple Thread 


TPC 


Texture Processor Clusters 


SM 


Streaming Multi-Processors 


SP 


Streaming Processors 


SIMT 


Single-Instruction, Multiple Thread 


SFU 


Special Function Units 


FPU 


Floating-Point Units 



Table 1 : Acronyms for gpu architecture. 



The sum of the throughput that can be obtained from the stream- 
ing processors and the special function units is therefore 933 
Gflop/s, the advertised single-precision peak performance of 
the gpu. It must be noted that this peak assumes that instruc- 
tions are constantly being co-issued to the sps and sfus. When 
no special functions are required in an algorithm, it is possi- 
ble to use the special-function unit to execute instead single- 
precision multiplies, thus in theory the peak throughput is avail- 
able. However, arguments still arise when various researchers 
meet to discuss their results as percentage of "peak"; some con- 
sider the practical peak to be 622 Gflop/s, for example. In dou- 
ble precision, on the other hand, there is only one fpu available 
per sm and thus the peak performance of the gpu is: 1.296 MHz 
x 30 sm x 1 fpu = 78 Gflop/s (the Fermi architecture improved 
on this considerably). 

While this analysis focuses on floating point performance, 
we are just as interested in overall performance of the gpu, also 
taking into account integer operations. Thus, we wish to take 
our analysis based of Gflop/s and try to draw some conclu- 
sions about the general performance in Operations/s (Gop/s). 
From [18], we can obtain the throughput of different arithmetic 
functions for different datatypes for compute capabilities 1.3 
and 2.0. For 1.3-capable cards (such as the Tesla C1060), we 
see that integer multiply and mutiply-add are listed as 'multi- 
ple instructions', while for 2.0 compute capability (such as the 
Fermi C2050), the throughput is identical for that of equiva- 
lent floating point operations. From this, we conclude that for 
the CI 060 peak performance in Gop/s is slightly lower than 
the value determined above in Gflop/s, but for the C2050 cards, 
performance in floating point and integers should be equivalent. 

Clearly, there will be many challenges in a particular appli- 
cation for extracting the promised performance out of the gpu. 
The throughput capacity calculated above refers to the max- 
imum number of instructions that can be issued on the chip. 
Moreover, the actual throughput will also depend on issues re- 
lated to the access to memory, in particular the high-latency 
accessess to global memory from the gpu chip (where memory 
transactions range between 400 and 600 clock cycles), and the 
use of shared memory in the streaming multi-processors to mit- 
igate the high-latency effects. 

For the algorithms of interest in this work, we have devel- 
oped formulations that demonstrate the power of heterogeneous 
computing using gpus. The subsequent section will discuss the 
work flow that was followed for the formulation of the algorith- 



mic kernels in such a way as to maximize performance in the 
architecture. 

4. Formulation of the algorithmic kernels for the GPU and 
optimization 

When formulating algorithms for the gpu, one wants to have 
the means to analyze their performance and efficiency in terms 
of resource utilization for the target architecture. By perform- 
ing such analysis, we can predict the maximum theoretical per- 
formance of the algorithm on the given hardware, and use this 
information as a guide during the design and optimization of 
the algorithm. 

Before any analysis can take place, we need to build a sim- 
plified model of the gpu that characterizes all the important 
features of the architecture. From our introduction in sec- 
tion §3.2, a current-generation gpu can be described as a mas- 
sively parallel architecture that has been optimized for high 
throughput. In our simplified model, we consider each gpu to 
have hundreds of very simple processors (processing elements) 
that can execute concurrently. We will assume that each pro- 
cessing elements is capable of computing the following opera- 
tions, in single precision and at a constant cost: arithmetic op- 
erations (+, -, x, -r, transcendental functions), and conditional 
statements. Regarding the data movement operations (load, 
store, and copy), we will not take into account all the details of 
the gpu memory hierarchy but use a simplified model where we 
distinguish between off-chip memory (global memory) and on- 
chip memory (registers and shared memory) — the main con- 
siderations are the limited bandwidth available and limited ca- 
pacity, respectively. 

Using the simplified gpu model, we can estimate the opera- 
tion complexity and data accesses of an algorithm. We use this 
information to evaluate the algorithmic performance against the 
hardware theoretical throughput and bandwidth. 

> Throughput (measured in op/s): The average number of 
operations per second that can be executed by the gpu. A 
large number of op/s implies a high level of theoretical 
peak performance. Traditionally, this number is presented 
as Giga-op/s, abbreviated as Gop/s. Giga-flop/s or Gflop/s 
are also commonly presented, but using op/s gives a better 
idea of how close to the peak performance of the gpu an 
algorithm obtains. 
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Figure 4: Details of GT200 gpu architecture. 



> Bandwidth (measured in Giga-Bytes/sec, GB/s): The rate 
at which data is transferred between the memory and the 
processor. Bandwidth measures how much data is being 
read and written from global memory in a unit time. In 
practice, many applications are bandwidth-limited, there- 
fore it is important to use the bandwidth as efficiently as 
possible. 

In addition to our gpu model, we define four algorithmic 
properties that map to different features of the gpu paradigm. 
These properties are then used as guidelines when rethinking 
an algorithm: 

> Computational intensity: ratio of operations performed to 
memory accesses in a unit of time. A compute-intensive 
task will have an operations -to-memory access ratio larger 
than the one given by the theoretical performance of the 
gpu. Highly compute-intensive tasks benefit from the high 
throughput of the gpu, while low compute-intensive tasks 
benefit from the large memory bandwidth of the gpu. 

> Concurrency: parts of the algorithm that may be exe- 
cuted simultaneously. There are many levels of concur- 
rency that range from coarse-grained concurrency to fine- 
grained concurrency. The latest gpus can benefit from mul- 
tiple levels of concurrency. For instance, when using cuda- 
enabled devices, one is able to explicitly identify concur- 
rency at different levels, coarse- to fine-grained: cuda ker- 
nel, thread-block, and thread calculations. 

> Homogeneity: degree at which concurrent computations 
are the same, independently of their input. Concurrent 
tasks that present a high degree of homogeneity can take 
full advantage of the 'single-instruction /multiple-thread' 
simt model of the gpu architecture. 

> Data-locality: the way in which the physically stored data 
is accessed by an algorithm. The role of data locality 



is twofold: a high degree of spatial data locality can re- 
sult in more efficient data accesses, e.g., when data that 
is spatially adjacent can be accessed by thread collabora- 
tion, resulting in more efficient memory transactions. A 
high degree of temporal data locality avoids redundant 
data transfers of data values that are frequently accessed 
within small time windows. 

An ideal algorithm for the gpu would be computationally 
intensive and have a high degree of concurrency. Concurrent 
computations would be highly homogeneous and show a large 
amount of data locality. However, in practice these properties 
are difficult to measure and as such do not allow one to effec- 
tively quantify the expected performance of the algorithm under 
study. 

4.1. Fast multipole method for the gpu 

We will now concentrate on the algorithmic redesign of the 
fmm for gpu computing, whereas a more in-depth discussion 
of the implementation details takes place in §5. As such, we 
start by noticing that the most time-consuming stages of the fmm 
algorithm are: the near-field calculations, and the downward 
sweep. As we observed in §2.1, these stages can take 95 - 
99% of the runtime in a serial implementation of the algorithm, 
and continue to dominate in parallel implementations despite 
parallel overheads. 

Let us discuss in more detail the computations performed by 
these two stages. In the case of the near-field calculations, the 
problem is to compute the mutual interactions of particles in 
many small- to medium-size domains, with each domain con- 
taining at most a few hundred particles. Therefore, the near- 
field calculation problem can be reduced to efficiently imple- 
menting the direct particle interactions for each domain, and as 
such, this is a special case of the A^-body problem, where in- 
teractions are computed only with a subset of the N particles. 
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Furthermore, the near-field calculation retains the distinctive 
features of being computationally intensive and highly parallel. 
Therefore, this problem can be efficiently mapped to the gpu ar- 
chitecture, achieving high performance, as reported in [12, 13]. 
We note that, as pointed out by [12], a balanced fmm execution 
in the gpu will perform the near-field direct evaluation with a 
larger set of particles per box than in the cpu case. This dif- 
ference is due to the higher efficiency achieved by the gpu in 
the near-field calculations, as compared to the far-field calcula- 
tions. Consequently, the optimal workload distribution between 
these two stages is different in the gpu and cpu executions. 

In the downward sweep, the computations are dominated by 
the transformation of multipole expansions into local expan- 
sions (M2L step). Tens of thousands of concurrent M2L transfor- 
mations are computed in the fmm algorithm, this being the rea- 
son for such a high computational intensity. Each of these trans- 
formations can be expressed in the form of a matrix- vector mul- 
tiplication, where each matrix is dense, of size (p x p), where 
p corresponds to the number of terms of a truncated me. The 
algorithmic steps of the downward sweep are: 

1 . For every node of the hierarchical tree, we obtain the list 
of clusters that belong to the same level, effectively com- 
puting the interaction list of each node. 

2. Then for every node (or evaluation cluster) in the tree, the 
multipole expansions of each cluster in its interaction list 
are transformed into a local expansion centered at the eval- 
uation cluster. 

3. Finally, the cluster's local expansion is obtained by aggre- 
gating the local expansions obtained from the M2L trans- 
formation of all the clusters in the interaction list, this step 
is referred to as the reduction. 

We present the following example to illustrate the paral- 
lelism and computational intensity of the downward sweep 
stage. Consider a two-dimensional case (d = 2) where the com- 
putational domain is hierarchically decomposed by a uniform 
tree (a quadtree) with / = 5 levels of spatial refinement. Such 
a tree will approximately have 4' nodes, where each node will 
transform on average 27 multipole expansions (more precisely, 
this number corresponds to the size of the node's interaction 
list). Therefore, the approximate number of M2L transforma- 
tions computed are 4 5 x 27 = 27, 648. For deeper trees, more 
transformations would be needed. 

In the original algorithm for the M2L transformation, the mul- 
tipole expansion of a given cluster j, with center Xj and multi- 
pole coefficients rrij, is transformed into a local expansion for 
cluster i with center at x,. The result of the transformation will 
give the local expansion coefficients /,. In the 2-dimensional 
case, both the coordinates and expansion terms are usually rep- 
resented by complex numbers. The M2L transformation in its 
matrix-vector multiplication form has a transformation matrix 
A, with p x p complex elements, a n k- These elements can be 
obtained using the following expression: 
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Figure 5: Sketch of different strategies for M2L computation. 



As previously noted, the M2L stage is highly parallel and 
computationally intensive. However, in order to obtain a high- 
performance implementation for the gpu, the M2L stage needs to 
be reformulated. 

4.1.1. Reformulation of the M2L problem 

As a first approach, one can consider distributing the mat-vec 
operations across threads. That is, each cuDA-thread would per- 
form one such operation. Suppose that the evaluation requires 
a truncation parameter p = 12, as required for high-accuracy 
simulations; see for example [19]. In that case, each matrix has 
a size of 2p 2 = 288, requiring 2304 bytes of storage. Thus, in 
the gpu shared memory of 16 kB one could fit a maximum of 
6 such matrices, which implies that a maximum of 6 threads 
could be run in one multi-processor at a time. Clearly this num- 
ber of threads is much too small — as discussed in §3.2, more 
than one hundred threads per sm are needed in order to hide 
memory transfer latencies. Alternatively to storing the transfor- 
mation matrix in shared memory, it is feasible to use Equation 
(8) to compute the matrix elements as the matrix-vector multi- 
plication takes place (also known as matrix-free format). 

When performing the matrix-vector product in matrix-free 
format, one can "traverse" the matrix such that the matrix 
terms are efficiently calculated by reusing the computations per- 
formed to obtain the previous terms. Therefore, one can reuse 
a part of both the binomial and the power terms by traversing 
the diagonals of the matrix, as sketched in Figure 5(a). This 
represents a strategy for matrix creation and multiplication that 
is efficient from the point of view of operations that it performs, 
in both the gpu and cpu architectures. Moreover, for the gpu, by 
assigning the computations of one matrix- vector multiplication 
to only one cuDA-thread, thread synchronization is not required. 
Consider now that each thread block performs the transforma- 
tions of only one me for all the interaction list, accounting for 
an estimated 27 M2L transformations. During computation, the 



results of the transformation are stored in shared memory. With 
a maximum of 8 concurrent thread blocks per multi-processor 
in the GT200 gpu, we have a maximum of 27 X 8 = 216 active 
threads running on the multiprocessor. This nvidia gpu allows a 
maximum of 512 threads in each thread block and 1024 active 
threads per sm [3, p. 8], and thus the approach just described 
could achieve less than one third of the maximum thread ca- 
pacity of each sm, without considering the register usage per 
thread. When the translation of an me is finished, one needs 
to move the result from shared memory to global memory, in 
the form of one le consisting of 2p floats. Hence, transform- 
ing one me for all the interaction list results in the movement 
of 27 x 2p floats. For instance, translating one me of p = 12 
results in 648 floats that need to be stored in global memory, 
which at the hardware level results in a minimum of twenty 
128-byte and one 32-byte memory transactions (see §5.3 for 
a more complete discussion on memory management). When 
transferring the results to global memory, the memory transac- 
tions are executed by the thread-block one after the other, re- 
sulting in a stage where only memory transactions take place. 
A considerable disadvantage of such a memory-intensive stage 
is that multithreading is less effective in hiding the memory la- 
tency when the number of active threads is small. Nevertheless, 
our implementation using this approach sustained 20 Gop/s and 
was able to perform 2.5 million translations per second, using a 
single CI 060 tesla card. 

The performance of the implementation just described is very 
much below the theoretical peak for both throughput and band- 
width of a C1060 tesla card. An assessment of the cuda kernel 
implementation revealed that its main limitation was the inef- 
ficiency of the memory transfers to and from global memory. 
This was due to two problems: first, many of the memory trans- 
fers were non-coalesced (i.e. adjacent threads reading adjacent 
memory values, see §5.3 for a full explanation), thus the effec- 
tive bandwidth used by the kernel was very low when compared 
against the theoretical performance of the card; and second, the 
kernel did not have enough active threads working per sm to 
effectively hide the latency of the memory transfers, making 
the efficiency of the kernel implementation suffer. These two 
problems can be tracked down to a flaw in the design of our al- 
gorithm: using only one cuda thread to perform a single transla- 
tion. The resulting kernel was resource-intensive since the diag- 
onally traversing matrix-free multiplication required too many 
variables to store the state of the computations and to control 
the flow of the program, thus limiting the total number of active 
threads per sm. Additionally, having only one thread to per- 
form each translation made it difficult to use multiple threads to 
perform collaborative memory transactions. 

Desirable features of a more optimized kernel would be for 
it to be: (;') simpler, requiring less variables to store the state 
of its execution, thus allowing more threads to be active at a 
time; (ii) more parallel, allowing the use of more concurrent 
threads, and to use coordinated collaboration between threads 
for performing memory transfers. 

Now, let us examine again Equation (8). A more parallel al- 
ternative to traverse the matrix is to have multiple cuda threads 



assigned to traversing different rows of the matrix, as each row 
can be computed concurrently with each other; see Figure 5(b). 
As with the diagonal-traversal but to a more moderate extent, 
the row-traversal can also be implemented so that it reuses a 
part of both the binomial and the power terms. 

Let us consider for a moment a straightforward implementa- 
tion to obtain the powers in Equation (8) shown in the code frag- 
ment in Figure 6. What characterizes this simple approach is 
that the number of iterations per thread will depend on (k+n+1), 
which takes a different value for every thread. This straightfor- 
ward implementation would have two problems: first, threads 
within a thread-block naturally stop at different points provok- 
ing the threads within a warp to diverge; second, loop unrolling 
would be unavailable as the compiler can only effectively unroll 
loops with known trip counts. 

In our implementation, we separated the computation of 
powers in Equation (8) into two steps in order to minimize 
thread divergence, and to use compiler optimizations such as 
loop unrolling. In the first step of the calculation, the term 
(Xi - Xj)" +1 is computed by each thread; and in the second step, 
the next k (xi - xj) factors, common to all threads, are mul- 
tiplied. When computing the first step, we use a for-loop but 
with a fixed number of iterations (P — 1, with P the variable 
containing the truncation level of the me) and an conditional 
for computing only up to the power relevant to the thread; see 
the code fragment in Figure 7. This approach minimizes thread 
divergence at the cost of a small overhead of evaluating a few 
extra empty iterations. However, the overhead is minimal when 
compared to the performance that would be lost due to thread 
divergence. In contrast to the first step, the second step con- 
sists of homogeneous computations for all threads, therefore it 
can be easily implemented using a for-loop with a known trip 
of k. The second step achieves high performance as it can be 
loop-unrolled and further optimized by the compiler. For a dis- 
cussion on loop unrolling see §5.4. 

4.1.2. Discussion of the M2L gpu implementation and results 

One advantage of having more available threads per thread- 
block, is that all the memory transactions that take place to and 
from global memory can be organized so that threads in the 
same warp access sequential memory locations. This allows 
the gpu to optimize the memory transfers by grouping together 
memory transactions to the same segment in global memory. 
In the row-traversal kernel, there are two cases when access- 
ing global memory: the first case occurs when loading the 
multipole expansion coefficients from global memory to shared 
memory. The usage of a sequential memory layout of the me 
coefficients allows efficient memory transactions by warps of 
threads. The second case occurs when the threads finish com- 
puting one local expansion coefficient, and here each thread di- 
rectly saves the state into global memory. This global write is 
sequential between the threads, as at any given time a sequen- 
tial number of coefficients are being computed by a sequential 
set of threads. 

From a resource utilization point of view, each thread uses 
a small amount of resources: the shared memory usage is lim- 
ited to storing the me information that is read at the start of the 
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int k = 12; 

int n = threadldx.x; 

float ac , bd ; 
tknl_r = t_r ; 
tknl_i = t_i ; 

for (m = 1; m < k+n+1; m++) // tknl = tknl * t 
{ 

ac = tknl_r ; 

bd = tknl_i; 

tknl_r = ac * t_r - bd * t_i; 

tknl_i = ac * t_i + bd * t_r; 

} 



Figure 6: Code snippet for computing the complex power , with t = t% + i * Jg, This code shows a for-loop with a variable number of iterations. 



#define P 12 
int m ; 

float ac , bd ; 
float tnl_r = t_r; 
float tnl_i = t_i ; 

#pragma unroll 

for (int m = 1 ; m < P ; m++) 

{ 

if (threadldx.x >= m) // tnl = tnl * t 
{ 

ac = tnl_r ; 
bd = tnl_i ; 

tnl_r = ac * t_r - bd * t_i; 
tnl_i = ac * t_i + bd * t_r; 

} 



Figure 7: Code snippet for computing the complex power f' +1 , with ? = % + (* %. This code shows a for-loop where the number of iteration has been fixed at 
compile time to P - 1 , and it uses a conditional for computing only the relevant power to each thread. 



10 



#of 
terms 


#of 
trans. 


M2L kernel 
[seconds] 


Reduction 
[seconds] 


Data HtD 
[seconds] 


Data DtH 
[seconds] 


OPS. 
[Giga] 


B. 

[GB/s] 


Mill, trans, 
per sec 


8 


2160 


5.79e-05 


3.48e-05 


6.70e-05 


3.10e-05 


125.27 


2.60 


37.28 


8 


9072 


9.30e-05 


5.51e-05 


1.07e-04 


4.01e-05 


327.82 


6.81 


97.57 


8 


36720 


2.69e-04 


1.41e-04 


3.59e-04 


8.99e-05 


458.77 


9.53 


136.54 


8 


147312 


9.66e-04 


4.91e-04 


1.27e-03 


2.80e-04 


512.35 


10.65 


152.49 


8 


589680 


3.69e-03 


1.91e-03 


4.40e-03 


8.29e-04 


537.36 


11.17 


159.93 


8 


2359152 


1 .44e-02 


7.58e-03 


1.65e-02 


3.10e-03 


548.72 


11.40 


163.31 


12 


2160 


7.41e-05 


3.48e-05 


6.91e-05 


2.69e-05 


185.97 


2.93 


29.13 


12 


9072 


1.60e-04 


5.79e-05 


1.17e-04 


4.41e-05 


362.02 


5.71 


56.71 


12 


36720 


5.11e-04 


1.45e-04 


4.09e-04 


1.28e-04 


458.60 


7.24 


71.84 


12 


147312 


1.90e-03 


5.17e-04 


1.37e-03 


4.02e-04 


493.93 


7.79 


77.37 


12 


589680 


7.44e-03 


2.01e-03 


4.87e-03 


1.14e-03 


506.32 


7.99 


79.31 


12 


2359152 


2.95e-02 


8.00e-03 


1.78e-02 


4.51e-03 


511.06 


8.06 


80.05 


16 


2160 


9.39e-05 


3.48e-05 


7.30e-05 


2.91e-05 


236.93 


3.03 


22.99 


16 


9072 


2.26e-04 


5.70e-05 


1.31e-04 


5.20e-05 


413.58 


5.28 


40.14 


16 


36720 


7.68e-04 


1.52e-04 


4.31e-04 


1.62e-04 


492.69 


6.29 


47.82 


16 


147312 


2.94e-03 


5.36e-04 


1.47e-03 


5.15e-04 


515.93 


6.59 


50.07 


16 


589680 


1.16e-02 


2.08e-03 


5.19e-03 


1.55e-03 


524.79 


6.70 


50.93 


16 


2359152 


4.61e-02 


8.30e-03 


1.89e-02 


5.85e-03 


526.90 


6.73 


51.14 



Table 2: Results of the multipole-to-local computation on the nvidia tesla gpu. Each row entry of the table presents the results of a single test run. The description 
of the columns follows, from left to right: number of terms computed, number of translations performed, gpu execution time for M2L kernel, gpu execution time for 
reduction, time for data transfer from host to device (HtD), time for data transfer from device to host (DtH), number of giga-operations per second (OPS.) for M2L 
kernel, effective bandwidth (B.) utilization for M2L kernel, metric of M2L translations per second performed (in millions). 
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8 


9072 


7.61e-05 


3.81e-05 


9.92e-05 


2.41e-05 


400.79 


8.33 


119.28 


8 


36720 


2.52e-04 


1.25e-04 


2.93e-04 


4.89e-05 


489.58 


10.17 


145.71 


8 


147312 


9.41e-04 


4.96e-04 


8.46e-04 


1.32e-04 


526.11 


10.93 


156.58 


8 


589680 


3.70e-03 


2.02e-03 


2.71e-03 


5.79e-04 


535.66 


11.13 


159.42 


8 


2359152 


1.47e-02 


8.19e-03 


9.33e-03 


2.55e-03 


538.79 


11.20 


160.35 




















12 


9072 


1.34e-04 


4.01e-05 


9.70e-05 


2.79e-05 


432.23 


6.82 


67.71 


12 


36720 


4.87e-04 


1.27e-04 


2.65e-04 


5.70e-05 


481.27 


7.59 


75.39 


12 


147312 


1.87e-03 


5.03e-04 


1.23e-03 


2.98e-04 


502.67 


7.93 


78.74 


12 


589680 


7.42e-03 


2.04e-03 


2.86e-03 


7.26e-04 


507.56 


8.01 


79.50 


12 


2359152 


2.96e-02 


8.31e-03 


9.34e-03 


2.24e-03 


509.26 


8.03 


79.77 


16 


2160 


6.60e-05 


2.22e-05 


5.39e-05 


2.00e-05 


337.01 


4.31 


32.71 


16 


9072 


1.88e-04 


3.89e-05 


1.17e-04 


3.29e-05 


497.56 


6.36 


48.29 


16 


36720 


7.05e-04 


1.31e-04 


3.41e-04 


7.70e-05 


536.68 


6.86 


52.08 


16 


147312 


2.73e-03 


5.10e-04 


1.27e-03 


2.97e-04 


556.61 


7.11 


54.02 


16 


589680 


1.09e-02 


2.08e-03 


3.07e-03 


9.81e-04 


559.91 


7.15 


54.34 


16 


2359152 


4.33e-02 


8.43e-03 


9.90e-03 


2.94e-03 


561.51 


7.17 


54.49 



Table 3: Results of the multipole-to-local computation on the nvidia fermi gpu. Each row entry of the table presents the results of a single test run. The description 
of the columns follows, from left to right: number of terms computed, number of translations performed, gpu execution time for M2L kernel, gpu execution time for 
reduction, time for data transfer from host to device (HtD), time for data transfer from device to host (DtH), number of giga-operations per second (OPS.) for M2L 
kernel, effective bandwidth (B.) utilization for M2L kernel, metric of M2L translations per second performed (in millions). 
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block execution, and all other memory is stored in the registers. 
Furthermore, as each thread only computes a single complex 
coefficient of a local expansion at a time, the overall register 
usage stays low. 

The crucial features of the row-traversal formulation of the 
kernel that resulted in remarkable performance gains are sum- 
marized as follows: 

1 . Increased the number of threads per block (in fact, we can 
have any number of threads). 

2. Avoid thread branching for threads in the same warp. 

3. Loop-unrolling (when done manually, performance was 
even greater). 

4. Reduced accesses to global memory. 

5. When accessing global memory, we ensure it is coalesced. 

The final step taken in the optimization of the gpu formula- 
tion of the kernel was to overlap memory movements to and 
from global memory with computational work. The result is a 
high-performance algorithmic formulation tuned for the gpu ar- 
chitecture, that attains 548 Giga operations per second on tesla 
and up to 561 Giga operations per second on fermi. 

In Table 2 and Table 3 we present performance results of sev- 
eral numerical experiments for the M2L kernel using tesla and 
fermi gpus, respectively. The numerical tests for the M2L im- 
plementation are controlled by two parameters: the number of 
terms of the multipole expansion, and the number of expansions 
to be translated. The intensity of the calculations is given by 
the number of terms of the me, and the size of the problem is 
given by the number of expansions to translate. Both tables 
report the time spent on the kernel execution and data trans- 
fer between host and gpu, and performance metrics for the M2L 
kernel. Performance metrics reported correspond to the number 
of giga-operations per second, effective bandwidth utilization, 
and the number of M2L translations per second performed in 
millions. 

Looking at the results presented in Table 2 and Table 3, con- 
sider that the M2L is a compute-bound kernel. As such, it is 
limited by the maximum throughput of the gpu. Therefore, for 
the gpu to achieve its maximum throughput performance, it re- 
quires a test run large enough to completely utilize all of the 
streaming processors. This is reflected in the results shown, as 
the M2L kernel execution on the gpu reports its best throughput 
on the test runs that have more me terms and larger problem 
sizes. We also report the number of translations per second 
as a "real world" performance metric that only depends on the 
test problem, i.e., it reflects wall-clock time to solution. Note 
that we obtained comparable single-precision performance be- 
tween the tesla and fermi architectures. We attribute this to the 
fact that our application does not benefit from the new features 
introduced in fermi, such as improved double precision perfor- 
mance, and LI & L2 cache. 

4.2. Fast Gauss transform 

We will propose several improvements beyond simply rewrit- 
ing cpu code for cuda, and later, in §5 we will present actual 



code used for these accelerations. Firstly, instead of using the 
entire fgt algorithm, we will focus on the evaluation of the Her- 
mite series. According to many test runs with a full cpu imple- 
mentation of the fgt, this part of the algorithms takes in the 
order of 90% of the total run time, so any optimizations here 
will have an appreciable effect on the overall speed of the algo- 
rithm. These optimizations will be placed in the context of the 
hardware metrics we introduced earlier in this section. 

The Hermite series evaluation takes a set of series coeffi- 
cients, A a , a target point (in d dimensions), y, the center of 
the source cluster, sg and returns the influence of that source 
cluster at y. The equation form is given previously by (6), but 
is presented again here for ease of reading: 

G(y) = YA a hJ y -^]. (9) 

4.2.7. Reformulation of the fgt algorithm for gpu 

If one were to make no changes in the fgt, then the gpu im- 
plementation would follow the same logic as that used for a 
normal cpu implementation. This means that for every source- 
target cluster interaction, a single cuda kernel would be called. 
Each call uses the coefficients for this cluster along with its cen- 
ter point. Each thread would then take care of a single target 
point and compute the potential for that point from the Her- 
mite coefficients. This approach, while naive, does have some 
advantages; it allows us to keep all variables needed by more 
than one thread in shared memory, while also ensuring concur- 
rency and homogeneity are high. However, such an approach 
makes no deliberate attempt to leverage the advantages of cuda. 
This is especially evident as the data sets being worked on are 
small — thus the overhead from memory accesses is going to be 
large compared to the run time, slowing down the implementa- 
tion significantly. 

Clearly, the approach described above will not give substan- 
tial acceleration on the hardware. Thus, we explore the algo- 
rithmic changes that will be necessary to improve performance. 
One first possibility is to work with multiple series expansions 
from multiple clusters at once. In this way, instead of sending 
the coefficients for a single source cluster, A a and its associated 
cluster center, sg, we amalgamate all necessary coefficients and 
cluster centers that must be evaluated at our target points into 
two large arrays, 

A a = [Af\A«\ ■■-,A<£] and, (10) 

s B = (ID 

In this way, we try to "hide" the time required to transfer all of 
the data to the device by increasing the computational intensity. 
We do this by performing a larger amount of work for each 
of the target points, which, once again, are assigned a single 
thread each, ensuring we keep the high levels of concurrency 
we would have enjoyed in a naive implementation. 

This iteration falls short on a couple of new points: firstly, 
the number of values to be read into shared memory are too 
many. The size of the amalgamated coefficients, cluster centers 
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and other necessary shared values such as a, can exceed the 
limited amount of shared memory (16kB) available. Secondly, 
to maximize concurrency and occupancy, we choose to let each 
thread read only a single value and perform collaborative mem- 
ory operations (as discussed in Section §5.5). The number of 
threads per block is hardware-limited to 512, hence the number 
of shared values that can be read from global memory cannot 
be greater than this before we potentially compromise the ho- 
mogeneity of the implementation. This indicates that we either 
need to store more values in global memory, take the perfor- 
mance hit and relinquish data-locality, or we must once again 
change our kernel. 

4.2.2. Final algorithm 

Taking into account the previously mentioned failings, our 
current implementation combines the best parts of both previ- 
ous attempts to obtain very high performance. Instead of having 
one single kernel call with all the data needed, or multiple small 
calls as in a naive implementation, we split the evaluation into 
smaller calls. Each call then evaluates the influence of multiple 
source clusters, such that the amount of data needed is small 
enough to fit into shared memory, giving us spatial and tempo- 
ral data-locality, and each thread need only read a single value 
into shared memory in order to ensure homogeneity. In this 
way, we ensure that all data needed across threads in a block 
(especially the series coefficients) is kept in shared memory, re- 
ducing the number of costly transfers between the device and 
host. This gives the gpu enough data to take full advantage of 
its features, while ensuring that all memory accesses are as fast 
as possible. 

A further advantage of this approach is that spatial and tem- 
poral data-locality can be maximized. However, this refactoring 
is only one step of the optimization process. Within our eval- 
uation of the Hermite series, we must perform other tasks, in 
particular evaluating Hermite polynomials for every appropri- 
ate source cluster at every target point. 

4.2.3. Hermite polynomial evaluation 

To evaluate polynomials in cuda, we must first have some 
way of representing and working with them. Representing 
polynomials can be done with an ordered array of coefficients, 
but evaluation requires more work. There already exists an 
algorithmically optimal method for the evaluation in Horner's 
rule, and this method was implemented as a function in cuda. 
Horner's rule allows us to evaluate a polynomial in O(n) time 
[20], where n is the degree of the polynomial. It does this by 
writing the evaluation in the form: 

n-i 

P(x) = ^ a,x' = flo + x ( a i + x ( a 2 + • • ■ + x(a„_i)) • • • )■ (12) 

i=0 

Our first version of this function used a loop over the polyno- 
mial coefficients, and was implemented as a device-only inlined 
function. This use of loops involves the evaluation of condition- 
als, and consequent potential branching of code. This branch- 
ing causes significant performance impact in cuda, and so this 



method needed to be re-implemented in some fashion to elim- 
inate (if possible) all of the branching code. Our solution to 
this problem was to manually unroll the evaluation loop, a pro- 
cess outlined in more detail in §5. This technique improved the 
computational intensity and thus throughput of the algorithm, 
and as an added incentive, reduced the number of registers that 
were being used, allowing us to optimize the occupancy more 
easily. 

4.2.4. Shared memory use 

The optimal use of shared memory is vitally important to 
overall performance. As outlined in §4, the access time for 
shared memory is around lOOx faster compared to global mem- 
ory. Thus, every variable that needs to be used by more than one 
thread, or even accessed more than once, should be copied into 
either shared memory (in the case of variables to be accessed 
by multiple threads) or local registers in order to obtain both 
spatial and temporal data locality. In the case of our implemen- 
tation, the Hermite series coefficients, A„ , the cluster centers, 

, the multi-index values themselves, a, all need to be copied 
into shared memory. 

4.2.5. Computing on-the-fly 

A corollary of the access time for global memory is that one 
can have a situation where it is easier to calculate certain val- 
ues on-the-fly, rather than read them from global memory. One 
such case is when calculating factorials. The standard method 
of calculating factorials by recursion is so computationally light 
that we can assign each factorial value to a single thread. This 
is preferential to the overhead of copying pre-computed values 
from global memory, and can be done while other threads are 
copying their values from global to shared memory. Addition- 
ally, all threads can calculate all values, but only save one. This 
allows us to unroll loops for greater performance and to main- 
tain zero-branching. 

4.2.6. Occupancy 

Equally distributing work and memory accesses across 
threads is a necessary step for maximizing the occupancy of our 
implementation, as is reducing the number of registers used by 
each individual thread. These considerations allow us to make 
use of the highest possible number of threads at all times. 

While the work performed by each thread is identical, ensur- 
ing homogeneity, memory transactions to global memory also 
need to be balanced. We achieve this balance by doing col- 
lective memory operations for retrieving data that is used by 
multiple threads. Each thread is then responsible for copying a 
single value from global memory into shared memory, and the 
index of the thread within its block dictates which value (poly- 
nomial coefficient, cluster center x or y value, etc.) the given 
thread will copy. For a more complete discussion, see §5.5. 

4.2.7. Results 

We present results of testing both a cuda and a standard 
cpu code implemented in C++. Attempts have been made to 
optimize this C++ code, including parallelization for shared 
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memory systems using openMP, loop unrolling (using the 
-f unroll-loops option from gcc) and evaluating hard-coded 
polynomial coefficients using Horner's rule. While this imple- 
mentation is still not optimal, it offers a comparable code. This 
also explains the greater than lOOx speedups observed, as these 
should be questionable when comparing similarly optimal cpu 
and gpu codes. 

Timings for these tests were obtained by assuming that the 
kernels would be called within part of a greater algorithm. 
Thus, time taken to assign / populate target points and clus- 
ter co-efficients was not counted. In both the cpu and gpu cases 
these values would have been read from a file and computed by 
a separate method, respectively. 

Tables 4 and 5 show comparisons between cpu and gpu codes 
running on identical datasets with a number of evaluation points 
ranging up to over 6 million. The truncation variable for the 
Hermite series is varied between p = 3 and p = 9. For table 
4, the cpu implementation is run on between 1 and 4 cores of 
an Intel Penryn processor, and the gpu code on a single nvidia 
tesla CI 060 card. For table 5, the processor is an Intel Ne- 
halem and the gpu is an nvidia fermi C2050. In both cases 
only the particles evaluated per second are compared (PPS in 
the tables), with the additional reporting of the number of giga- 
operations (GOPS) performed on the gpu given as a measure 
of the total utilization of the card. Useful to note is that the 
cpu code scales nearly linearly with the number of cores, show- 
ing it is at least fairly efficient, and that the greatest speedup 
from the gpu with respect to 4 cores on the gpu is around 25 
times on the Nehalem / fermi system and 30 times for Penryn / 
tesla. Given the good multi-processor scaling of the cpu code, 
this represents a roughly 100 - 120x speedup over a single cpu 
core. The greater than lOOx speedup encountered is above what 
is commonly quoted as possible for a compute-bound applica- 
tion, but can be explained due to the possibility of many more 
optimizations (such as SSE operations) being implemented in 
the cpu code. Particle evaluations per second (PPS) are also 
shown, to give a more "physical" representation of the speed 
of the code. While these results may seem high at first glance, 
they are within the theoretical peak of both cards (1033 Gops 
for the fermi card), and we have verified by examining the inter- 
mediate .ptx files that fused multiply-add instructions have been 
used throughout the main inner loop that evaluates the Hermite 
polynomial. Given that our calculation is entirely throughput- 
bound, this ensures that we obtain very good performance. 

The results presented show a significant disparity in the num- 
ber of giga-operations performed, especially in the case of 
p = 12, where the peak for the fermi card is above 1000 GOPS, 
while the tesla card only manages 700. The reason for this 
is likely the large number of integer operations required in the 
Hermite evaluation kernel in order to resolve memory locations. 
For instance, the full version of the code-unrolling snippet in 
Figure 8 requires only 2 floating point operations (+, x), which 
can be done in a single fused multiply-add (FMA) instruction, 
while it requires an additional 2 integer operations (+, x) to ob- 
tain the correct coefficient memory location. In fact, for every 
floating point operation, on average, an integer operation must 



also be performed. Referring to Table 5-1 in nvidia's program- 
ming guide [18], we see that for the tesla gpus (Compute Ca- 
pability 1.3), 32-bit integer multiplies and multiply-adds both 
require multiple instructions, giving an integer throughput sig- 
nificantly lower than that for floating point operations. From 
the same table, we see that the fermi card (Compute Capability 
2.0) has the same throughput for these operations as for floating 
point operations. This significant increase in integer throughput 
removes a bottleneck in the code, and thus allows to improve 
our performance over the previous generation of cards. 

5. GPU implementation discussion 

In order to assist other practitioners in reaching these lev- 
els of performance, we now discuss the design strategies and 
in some cases demonstrate optimized fragments of code used 
within our kernels. These methods have been tuned for perfor- 
mance and are suitable for adaptation and use in any applicable 
cuda implementation. 

5.7. Thread execution branching 

A code design consideration that greatly impacts perfor- 
mance is branching of threads during execution due, for ex- 
ample, to data-dependent conditionals. In the cuda architec- 
ture, streaming multi-processors are able to manage a multitude 
of concurrent threads, but they do so in groups of 32 parallel 
threads, called warps. All threads in a warp start together, but 
they may branch as they execute. If they do, the warp actually 
executes in serial mode the different branch paths, while threads 
that do not take the particular path wait, effectively idling and 
wasting cycles. Thus, all threads in the warp converge after 
the branching and then continue execution together. For this 
reason, branching can be a serious hurdle for obtaining perfor- 
mance. 

5.2. Multithreading 

The cuda architecture relies on no-cost multithreading to hide 
the high latency of memory accesses to global memory. A mul- 
tiprocessor of a gt200 gpu chip can have up to 1024 active 
threads and 8 active thread blocks at any given time. However, 
active threads on an sm share its limited resources (registers and 
shared memory). In practice, for a given kernel implementa- 
tion, the kernel's resource utilization defines the limit on the 
number of active threads per sm. The ideal for a memory-bound 
application is to have as many active threads as possible, to hide 
the latency of the memory transactions. In order to characterize 
the efficiency of the sm utilization, the concept of Occupancy [3] 
was introduced and defined as: 



Occupancy = 



Number of active threads 



(13) 



Maximum threads available 
The number of active threads will depend on a few param- 
eters: on one hand, the kernel implementation will define the 
number of resources used per thread and per thread block, on 
the other hand, the resources per sm are limited by the hardware. 
The only parameter that is defined by the user at run time is the 
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N 


P 


PPS 


PPS 


PPS 


GOPS 


PPS 






(1 Core) 


(2 Cores) 


(4 Cores) 


(GPU) 


(GPU) 


25600 


5 


1.68e+05 


3.36e+05 


6.59e+05 


461.3 


1.55e+07 


102400 


5 


1.68e+05 


3.35e+05 


6.60e+05 


539.1 


1.81e+07 


409600 


5 


1.69e+05 


3.34e+05 


6.68e+05 


563.2 


1.89e+07 


1 1 Ci A f\f\ 

1638400 


5 


1.69e+05 


3.37e+05 


6.72e+05 


570.7 


1.91e+07 


6553600 


5 


1.68e+05 


3.33e+05 


6.72e+05 


572.9 


1.92e+07 


25600 


9 


3.99e+04 


8.02e+04 


1.59e+05 


549.9 


3.93e+06 


102400 


9 


4.03e+04 


7.93e+04 


1.58e+05 


637.1 


4.55e+06 


409600 


9 


4.03e+04 


8.01e+04 


1.59e+05 


663.9 


4.74e+06 


1638400 


9 


4.03e+04 


8.03e+04 


1.61e+05 


671.6 


4.80e+06 


6553600 


9 


4.02e+04 


8.03e+04 


1.60e+05 


673.7 


4.81e+06 


25600 


12 


2.05e+04 


4.07e+04 


8.08e+04 


257.1 


8.30e+05 


102400 


12 


2.06e+04 


4.08e+04 


8.16e+04 


490.8 


1.59e+06 


409600 


12 


2.04e+04 


4.07e+04 


8.18e+04 


649.8 


2.10e+06 


1638400 


12 


2.05e+04 


4.09e+04 


8.17e+04 


688.8 


2.23e+06 



Table 4: Results on both a multi-core processor and gpu for the Hermite evaluation kernel. Presented are Particles per second (PPS) for all cases, and on the gpu the 
total number of giga-operations (GOPS) obtained are also shown. Runs were made on a Penryn series Intel cpu and nvidia tesla C1060 gpu 
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P 


PPS 

(1 Core) 


PPS 

(2 Cores) 


PPS 

(4 Cores) 


GOPS 
(GPU) 


PPS 
(GPU) 




5 












102400 


5 


2.50e+05 


4.93e+05 


9.39e+05 


780.0 


2.62e+07 


409600 


5 


2.54e+05 


4.92e+05 


9.31e+05 


815.3 


2.73e+07 


1638400 


5 


2.54e+05 


4.98e+05 


9.35e+05 


828.7 


2.78e+07 


6553600 


5 


2.58e+05 


5.00e+05 


9.76e+05 


830.3 


2.78e+07 


25600 


9 


7.12e+04 


1.38e+05 


2.64e+05 


786.2 


5.62e+06 




9 












409600 


9 


7.19e+04 


1.37e+05 


2.78e+05 


893.0 


6.38e+06 


1638400 


9 


7.21e+04 


1.39e+05 


2.78e+05 


905.9 


6.47e+06 


6553600 


9 


7.29e+04 


1.39e+05 


2.80e+05 


907.3 


6.48e+06 


25600 


12 


3.33e+04 


6.62e+04 


1.20e+05 


874.5 


2.83e+06 


102400 


12 


3.33e+04 


6.65e+04 


1.27e+05 


957.1 


3.09e+06 
















1638400 


12 


3.38e+04 


6.52e+04 


1.29e+05 


1010.8 


3.27e+06 



Table 5: Results from the Hermite evaluation kernel on multi-core processors and gpu. Particles per second (PPS) are presented for all cases, and the total number 
of giga-operations (GOPS) obtained on the gpu are also given. Runs were made on a Nehalem series Intel cpu and nvidia fermi C2050 gpu 
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number of threads used per thread block. It is generally the case 
for memory-bound applications that by maximizing the occu- 
pancy of the multiprocessor we achieve better performance. In 
our application, we maximized the gpu occupancy in two ways: 
the most straightforward approach was to select a thread-block 
size that maximizes occupancy, one tool to do this is provided 
by nvidia and is named "occupancy calculator" 2 ; the second 
approach was to hand-tune the implementation so that it reuses 
the maximum number of variables as possible. The last strat- 
egy has the most potential to improve occupancy by reducing 
the number of registers used, however it is a trial and error pro- 
cedure, as the actual resource utilization ultimately depends on 
the compiler and sometimes less variables do not translate in 
less registers being used. 

5.3. Memory management 

Memory management needs to be explicitly provided by the 
programmer. This is a challenge, as the efficiency of the mem- 
ory transactions directly depends on the algorithmic design and 
implementation. Issues that need to be considered are: 

> Small and fast shared memory. The limited size of the 
shared memory, at 16 kB for the gpu we used, clearly im- 
poses a restriction on the amount of fast storage available. 

> Large and slow global memory. Global memory is char- 
acterized by the large memory space, with up to 4 GB of 
memory, but with very high latency of access from the gpu 
chip (each transaction taking between 400 and 600 clock 
cycles). 

> Shared memory conflicts. Shared memory is physically di- 
vided into 16 blocks, and each block can do one memory 
transaction per clock (read, write, or broadcast). There- 
fore, if more than one thread of the same warp accesses 
data in the same memory block, the thread memory oper- 
ations are queued and executed sequentially. 

> Efficient global memory accesses. Global memory is phys- 
ically accessed as 32-, 64-, or 128-byte segments by the 
hardware. Therefore, every memory transaction that is is- 
sued will read or write to a whole segment, regardless of 
whether only one element of the segment is read or writ- 
ten into the segment. In this case, much of the memory 
bandwidth would be inefficiently used. Efficient memory 
transactions that are aligned and efficient per thread warp 
are commonly referred to as coalesced memory transac- 
tions. 

> Memory camping. Global memory is divided into physical 
blocks. The total bandwidth of global memory is calcu- 
lated as the aggregated bandwidth of all the blocks. In or- 
der to effectively use the bandwidth, memory transactions 
need to be spread across all blocks. In our implementa- 
tion, we addressed this by designing the kernels so that the 
thread blocks evenly access the global memory locations. 



2 http://news. developer.nvidia.com/2007/03/cuda_occupancy_. html 



5.4. Loop unrolling 

In several parts of our kernels, we require iterating over a 
loop to calculate values, for instance in evaluating multipole 
expansions in the fmm and Hermite polynomials in the fgt. 
Loop-unrolling is a well-known process whereby a loop over 
some pre-determined number of iterations, say N, is replaced 
with the loop's body code, repeated TV times. This technique 
trades a larger compiled binary size for increased performance. 
This kind of optimization can be performed automatically by 
the cuda compiler, but in our testing, we found that manually 
unrolling the code when possible resulted in both increased per- 
formance, and fewer registers being used, allowing us to more 
easily maximize the occupancy of our kernels. 

The specific example we demonstrate is the evaluation of 
Hermite polynomials in the fgt. This is done by implement- 
ing Homer's method, as previously described. Consider the 
code example shown in Figure 8. This code fragment shows 
the loop-based implementation, with *H being a pointer to an 
array of polynomial coefficients, x the desired evaluation point, 
and poly_len denoting the degree of the polynomial to be eval- 
uated. The code fragment in Figure 9 demonstrates the same 
loop, but now unrolled for the instance where the polynomial 
degree is 5. 

While this particular piece of code has been generated by 
hand, it would be easy to have a separate piece of code to 
automatically generate at compile time the correct number of 
lines for a given number of terms. This would ensure the best 
possible performance, while removing the tedious and possibly 
error-prone manual adding of lines for more terms. 

The optimization just described deals with speeding up com- 
putations, but we still need to get the data to compute efficiently. 
Thus, we now discuss balancing reads and writes from global 
memory. 

5.5. Balancing global reads/writes across threads 

To maintain an equal amount of load across all working 
threads, it is important to ensure that no thread reads or writes 
disproportionately more to the global memory than any other. 
The main area where this problem appears is during the trans- 
fer of commonly-used variables from global to shared memory. 
A naive method may result in some threads reading multiple 
values into shared memory, while others read none, an obvi- 
ously unsatisfactory situation. Thus, we propose the following 
system: the maximum number of values to be read into shared 
memory is the same as the number of threads within a block. 
The position of a thread within a block dictates which value 
it will read. This ensures that no thread reads more than one 
value. The example shown in Figure 10 is taken from our fgt 
cuda code. 

The code snippet divides our threads so that as many values 
are read simultaneously as possible, with the type of variable 
to be read determined by the thread's index within its block. 
The first group of threads will calculate factorial values; the 
second will read polynomial coefficients and the third group 
will read the expansion center. In the second part of the code 
snippet, lines 29 to 49, the code segment diverges the reads 
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__device__ float poly _e val ( f loat *H, float x, int poly_len) 
{ 

int i ; 

float y = H [0] ; 

/* LOOP TO BE UNROLLED */ 

for (i=l; i < poly_len; i=i+l) 

{ 

y=H[i] + x*y ; 

} 

return y ; 
y = poly_e val (H , x , num_terms ) ; 

Figure 8: Code snippet showing a loop-based implementation of the polynomial evaluation. 



/* Manually unrolled code */ 

y = H [0] ; 

y = H[l] + x*y; 

y = H [2] + x*y; 

y = H [3] + x*y; 

y = H [4] + x*y; 



Figure 9: Code snippet showing the unrolled loop. 



// shared memory 

__shared__ int shar ed_f act [ NUM_TERMS ] ; 
__shared__ int shared_alpha [LEN_ALPHA] ; 
__shared__ float shared_sb [2] ; 

// Read vars into shared memory 

// WARNING: Each block needs more threads than ( LEN_ ALPHA + NUM_TERMS 

if (threadldx.x < NUM_TERHS) { 
// generate factorial case 
k = 0; 
alphal = 1 ; 

for (i=l; i < threadldx.x + 1; i++) { 
alphal *= i ; 

} 

// store in shared memory 
shared_f act [threadldx . x + k] = alphal; 
} else if (threadldx.x < NUM_TERMS + LEN_ ALPHA ) { 
// read alpha into shared 
k = -NUM_TERMS; 

shared_alpha [threadldx . x + k] = alpha [threadldx . x + k] ; 
} else if (threadldx.x < NUM_TERMS + LEN_ ALPHA + 2) { 
// read sb into shared memory 
k = - NUM_TERMS - LEN_ALPHA ; 

shared_sb [threadldx . x + k] = sb [threadldx . x + k] ; 
} else { 

// default case - no read 
k = 0; 

} 

__syncthreads () ; 



Figure 10: Code snippet showing the balanced reads/writes. 
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across all threads and contiguous reads are achieved by having 
adjacent threads reading adjacent values from memory. It is 
important to note here that in earlier versions of the nvidia cuda 
compiler, "if" statements forced thread branching on a whole 
threadblock, while a "switch" statement only branched within 
a thread warp. Thus, the optimal version of the code in Figure 
10 for compilers previous to version 3.x involved splitting the 
code into two separate steps: an "if" statement to decide which 
threads read which values, and then a separate "switch" step to 
actually perform the reads. However, with compiler versions 
3.0 and above, this is no longer appropriate, and so the code in 
Figure 10 is optimal. 

This implementation has significant advantages: first, mem- 
ory transactions can be organized such that adjacent threads 
within the same warp can read/write adjacent values from/to 
global memory, therefore ensuring coalesced memory transac- 
tions. Secondly, while this example only handles reading as 
many values as we have threads, it is trivially extendable to han- 
dle as many reads as a user desires. Finally, for some memory- 
intensive kernels, it is possible to define the size of a thread 
block to obtain the best kernel performance for memory trans- 
actions, even if this implies that some threads could remain idle 
during the compute-intensive part of the kernel. 

5.6. Other design strategies 

> Keep the kernel complexity low. It is better to have a highly 
specialized kernel than a general purpose implementation. 
We consider three reasons for this: first, highly specialized 
kernels keep the resource utilization low, which in turn al- 
lows having more active threads per multiprocessor; sec- 
ond, high specialization allows for the use of specific op- 
timizations; third, general cases normally result in branch- 
ing conditionals, and this reduces the kernel performance. 

> Use many threads per thread block. Many threads can co- 
operate during memory transactions, resulting in more ef- 
ficient use of the memory bandwidth. This strategy is more 
important in the negative, i.e., for kernels that use too few 
threads it will be difficult to perform coalesced memory 
transactions. 

> Interleave computations and memory transactions. In a 
kernel, it is a good idea to interleave memory transfers 
with computational work within the same kernel, in con- 
trast with using a three-stage kernel that consists of one 
stage to load data, a second stage to perform computations, 
and a third stage for saving data. We observed that in our 
cases it was better to have smaller stages that interleave 
memory transfers (load/write) with work. In this way, the 
multiprocessor-limited resources can be used more effi- 
ciently and it is easier to hide latency by multithreading, 
as it is more difficult to overlap threads performing big 
stages than a series of small efficient stages. 

> On-the-fly calculations. As stated several times before, 
there is a significant overhead related to reading and writ- 
ing from global memory. This overhead can be so signif- 
icant that for some purposes it may be more efficient to 



re-calculate values than to store and retrieve them. A good 
example of this is found in a factorial cache. In a standard 
cpu code, it is more efficient to pre-compute the values of 
all the factorials you may need to use at the beginning of 
your code, then simply re-use these values when needed. 
In a gpu implementation however, these values would need 
to be read into shared memory for each block. Instead of 
doing this, we consider designating a set of threads in each 
block to calculate these values instead of reading from 
memory. 

6. Conclusions 

This work demonstrates that it is possible to attain close to 
the practical peak of modern gpu architectures, not just with 
embarrassingly parallel problems, but with real-world, elabo- 
rate algorithms. The process, however, is not straight-forward. 
Intimate knowledge of the architectural features is required to 
rethink the core methods such that performance can be maxi- 
mized. This familiarity with the hardware architecture, in fact, 
reveals that some algorithms will be better suited to perform 
well on the gpu. Some others will suffer some unavoidable bot- 
tlenecks due to being bound by memory bandwidth, rather than 
computation. When this happens, only moderate speedups will 
be possible on the new hardware. When a computationally in- 
tensive algorithm is properly reformulated for the gpu, however, 
two-order of magnitude speedups can be obtained. This holds 
fantastic promise for extending the capability of computing to 
solve some of the most challenging scientific problems. 

The results obtained here, sustaining over 500 Gigaop/s on 
one tesla C1060 card for fast summation algorithms, can have 
significant impact for applications using these algorithms. In 
particular, we are currently working to develop boundary el- 
ement methods which are accelerated with the fast multipole 
method. Considering the potential speedup in applications, we 
anticipate unprecedented capability for acoustics, electromag- 
netics, bioelectrostatics, and other applications of the bem and 

FMM. 

Our next challenge is to utilize the gpu kernels developed for 
this work in a truly heterogeneous environment. We will cou- 
ple these kernels to software libraries for distributed systems, 
and investigate scalability with multiple cpu cores and multiple 
gpus. Such work will offer application scientists tools for truly 
groundbreaking discovery through advanced computing. 

Codes and test scripts used to obtain the results presented in 
this paper are available from http://code.google.eom/p/cufast/. 
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